Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

نویسندگان

  • David Minor
  • Reagan Moore
  • Bing Zhu
  • Charles Cowart
چکیده

The archiving of Internet materials presents two major challenges: the dynamic nature of the content and the massive number of individual web pages. These two characteristics impact the choice of methods for storing, indexing and providing fast access to archives of materials retrieved from periodic web crawling activities. At the San Diego Supercomputer Center, we have applied two different technologies for archiving, indexing and accessing web archived materials for two projects: a Library of Congress / San Diego Supercomputer Center joint storage project (LOC/SDSC) and the National Science Digital Library (NSDL) persistent archive. In the LOC/SDSC project, we deployed the Wayback software for indexing and accessing a large archive collection from the Library of Congress. Our experience demonstrates that the Wayback software (release 0.6.0) is mature and suitable for handling small collections. Our modifications to the Wayback machine enabled the management of distributed collections. In the NSDL project, we used the Storage Resource Broker (SRB) data grid to index and archive web materials. This approach enabled the use of distributed and heterogeneous storage systems, while supporting access from multiple types of clients, ranging from web browsers, to workflow systems, to Perl and Python load libraries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Net-Fli: On-the-fly Compression, Archiving and Indexing of Streaming Network Traffic

The ever-increasing number of intrusions in public and commercial networks has created the need for high-speed archival solutions that continuously store streaming network data to enable forensic analysis and auditing. However, “turning back the clock” for post-attack analyses is not a trivial task. The first major challenge is that the solution has to sustain data archiving under extremely hig...

متن کامل

بررسی وضعیت نمایه شدن مجلات لاتین مصوب علوم پزشکی کشور در نمایه‎نامه های معتبر جهانی

Background and Aim: Today journals are one of the main platforms to exchange information between researchers. This study aimed to assess the status of Approved Latin indexing journals in the field of medical science citation indexes Web of Science and Scopus databases. Materials and Methods: This study was a cross-sectional descriptive survey. Statistical population of the study was 83 titles ...

متن کامل

View Planning for Cityscape Archiving and Visualization

This work explores full registration of scenes in a large area purely based images for city indexing and visualization. Ground-based images including route panoramas, scene tunnels, panoramic views, and spherical views are acquired in the area and are associated with geospatial information. In this paper, we plan distributed locations and paths in the urban area based on the visibility, image p...

متن کامل

On the Use of Linguistic Ontologies for Accessing and Indexing Distributed Digital Libraries

Digital libraries containing vast amounts of information are expected to soon be readily available over wide area distributed networks. As the amount of data available increases, the problem of indexing the resources efficiently and accessing them easily will also increase. There have been several approachers proposed previously: these range from simple keyword retrieval (as in DIALOG, and MEDL...

متن کامل

Web Archiving and Digital Libraries (WADL) 2016: Highlights and Introduction to this Special Issue

This workshop, reported in the following 12 papers, explored the integration of Web archiving and digital libraries, so the complete life cycle involved was introduced: creation/authoring, uploading/publishing in the Web (2.0), (focused) crawling, indexing, exploration (searching, browsing), archiving (of events), etc. It included particular coverage of current topics of interest, e.g., big dat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007